Introduction

This notebook is the second delivery for Data Valorization course, in Université Cote de Azur, Nice (2023). You can watch the explanation for the aim of the project and the data cleaning here.

We will perform the regression with the following models, looking to obtain the lowest RSME:

Authors

Tymoteusz Igor Ciesielski ,

Lucia Trillo Carreras

Ewa Kupczyk

Marina Bueno Garcia

Libraries

options(warn=-1) # We want to generate a nice-looking notebook without warnings 
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.1     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.1     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly) #interactive visualisation
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(lattice) #data exploration (corellations)
library(dplyr)
library(countrycode) # For adding continent variable
library(corrplot)
## corrplot 0.92 loaded
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caTools)
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## 
## The following object is masked from 'package:base':
## 
##     Recall
library(e1071)

Data

The salary dataset we have used is scraped from Levels Fyi, a US-based online service for compensation comparison across companies in Tech. It contained 62 642 samples and 29 variables, predominantly categorical. We import our cleaned dataset from the first delivery, where we performed the preprocessing and one-hot encoding.

We import our cleaned dataset from the first delivery:

df_onehot <- read.csv('data/clean_salary_data.csv')
df_onehot <- na.omit(df_onehot) #Ensure we dont have NA values

We will feed our regression models with the one-hot encoded variables:

df_onehot
dim(df_onehot)
## [1] 21589    37

We define our output variable Y - totalyearlycompensation, meaning how much a person X earns during the whole year:

Y <- df_onehot[c("totalyearlycompensation")]
df_onehot <- subset(df_onehot, select = -c(totalyearlycompensation))

We split the data into training and testing sets:

set.seed(123)
trainIndex <- sample(1:nrow(df_onehot), 0.7 * nrow(df_onehot))
Ytrain <- Y[trainIndex,]
trainData <- df_onehot[trainIndex,]
Ytest <- Y[-trainIndex,]
testData <- df_onehot[-trainIndex,]

Linear Regression

Fit the linear regression model using the training and predict using the testing data:

linearModel <- lm(Ytrain ~ ., data = trainData)
predictions <- predict(linearModel, newdata = testData)

Evaluate our model plotting the residuals and calculating the RMSE:

residuals <- Ytest - predictions
plot(residuals ~ predictions, xlab = "Predictions", ylab = "Residuals",
     main = "Residuals vs. Predictions")
abline(h = 0, col = "red")

rmse <- sqrt(mean(residuals^2))
R2_LR <- 1 - sum((residuals)^2) / sum((Ytest - mean(Ytest))^2)

cat("Root mean Squared Error of Linear Regression:", rmse, "\n")
## Root mean Squared Error of Linear Regression: 91757.71
cat("R squared coefficient of Linear Regression:", R2_LR, "\n")
## R squared coefficient of Linear Regression: 0.4600771
summary(Y)
##  totalyearlycompensation
##  Min.   :  10000        
##  1st Qu.: 119000        
##  Median : 173000        
##  Mean   : 197425        
##  3rd Qu.: 245000        
##  Max.   :2372000

Finally, we analyze the model:

summary(linearModel)
## 
## Call:
## lm(formula = Ytrain ~ ., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -365054  -50086  -13849   30283 1834341 
## 
## Coefficients: (4 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         23557.2   108953.7   0.216 0.828825    
## ed_bachelor        -50609.2    54090.8  -0.936 0.349477    
## ed_master          -35509.7    54082.9  -0.657 0.511462    
## ed_doctor           20955.4    54248.0   0.386 0.699288    
## race_asian          27628.4    93675.7   0.295 0.768046    
## race_white          25232.1    93692.3   0.269 0.787696    
## race_hispanic       15116.0    93619.9   0.161 0.871732    
## race_black          16926.8    93786.4   0.180 0.856776    
## race_two_or_more    34239.4    93773.8   0.365 0.715021    
## yearsofexperience    8944.3      164.0  54.554  < 2e-16 ***
## yearsatcompany      -1786.8      278.7  -6.412 1.48e-10 ***
## ed_other           -60383.4    54296.5  -1.112 0.266111    
## gender_female      -32165.6    10861.9  -2.961 0.003068 ** 
## gender_male        -20540.7    10733.0  -1.914 0.055667 .  
## gender_other             NA         NA      NA       NA    
## com_oracle          41656.1     8101.7   5.142 2.76e-07 ***
## com_other           46423.1     5973.4   7.772 8.25e-15 ***
## com_amazon          81059.7     6303.2  12.860  < 2e-16 ***
## com_apple          117429.4     7487.4  15.684  < 2e-16 ***
## com_salesforce      87244.8     8561.3  10.191  < 2e-16 ***
## com_facebook       188222.5     6889.5  27.320  < 2e-16 ***
## com_google         134166.3     6664.9  20.130  < 2e-16 ***
## com_intel           24415.6     8610.3   2.836 0.004580 ** 
## com_microsoft       58023.6     6496.9   8.931  < 2e-16 ***
## com_ibm                  NA         NA      NA       NA    
## title_pmanager      35163.6     5169.3   6.802 1.07e-11 ***
## title_sweng         15427.7     4321.0   3.570 0.000358 ***
## title_swengman      76961.9     5554.7  13.855  < 2e-16 ***
## title_datasci        9571.3     5652.9   1.693 0.090442 .  
## title_other        -17903.6     4644.4  -3.855 0.000116 ***
## title_hweng              NA         NA      NA       NA    
## continent_namerica  94840.0     3054.0  31.054  < 2e-16 ***
## continent_samerica -47260.8    16273.1  -2.904 0.003687 ** 
## continent_africa    34400.3    31467.3   1.093 0.274321    
## continent_oceania   48670.2    11198.2   4.346 1.39e-05 ***
## continent_asia     -30414.1     3989.2  -7.624 2.60e-14 ***
## continent_europe         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93600 on 15079 degrees of freedom
## Multiple R-squared:  0.4497, Adjusted R-squared:  0.4485 
## F-statistic:   385 on 32 and 15079 DF,  p-value: < 2.2e-16

SVR

SVR is a type of supervised machine learning algorithm used for classification and regression analysis.

We implement 6 different models:

svmModel_nu <- svm(Ytrain ~ ., data = trainData, kernel = "linear", type = "nu-regression")

summary(svmModel_nu)
## 
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "linear", type = "nu-regression")
## 
## 
## Parameters:
##    SVM-Type:  nu-regression 
##  SVM-Kernel:  linear 
##        cost:  1 
##          nu:  0.5 
## 
## Number of Support Vectors:  7576
coef(svmModel_nu)
##        (Intercept)        ed_bachelor          ed_master          ed_doctor 
##       -0.097563025       -0.038727556        0.012030658        0.084853103 
##         race_asian         race_white      race_hispanic         race_black 
##        0.014890844       -0.007536994       -0.014142729       -0.014134388 
##   race_two_or_more  yearsofexperience     yearsatcompany           ed_other 
##        0.009076933        0.354937948       -0.052757125       -0.022627934 
##      gender_female        gender_male       gender_other         com_oracle 
##       -0.011780481        0.010580065        0.005820983       -0.023680391 
##          com_other         com_amazon          com_apple     com_salesforce 
##       -0.114341764        0.027683795        0.057420485        0.015392605 
##       com_facebook         com_google          com_intel      com_microsoft 
##        0.156081922        0.104911952       -0.038831414       -0.021549315 
##            com_ibm     title_pmanager        title_sweng     title_swengman 
##       -0.067304670        0.031211252        0.005795747        0.088753323 
##      title_datasci        title_other        title_hweng continent_namerica 
##       -0.005109305       -0.069320591       -0.012717556        0.160482699 
## continent_samerica   continent_africa  continent_oceania     continent_asia 
##       -0.027190088       -0.001593356        0.009465945       -0.149976481 
##   continent_europe 
##       -0.061930428
# Test your SVM model on the testing set
svmPred_nu <- predict(svmModel_nu, testData)

# Evaluate the accuracy of your SVM model: RMSE
RMSE_nu <- sqrt(mean((svmPred_nu - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_nu, "\n")
## 
##  Root Mean Squared Error: 93505.17
svmModel_eps <- svm(Ytrain ~ ., data = trainData, kernel = "linear", type="eps-regression")
summary(svmModel_eps)
## 
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "linear", type = "eps-regression")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.02777778 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  12228
coef(svmModel_eps)
##        (Intercept)        ed_bachelor          ed_master          ed_doctor 
##       -0.139066544       -0.104691733       -0.053295831        0.055092885 
##         race_asian         race_white      race_hispanic         race_black 
##        0.108625940        0.081531877        0.025630440        0.013201953 
##   race_two_or_more  yearsofexperience     yearsatcompany           ed_other 
##        0.044532632        0.357583565       -0.049856955       -0.047258669 
##      gender_female        gender_male       gender_other         com_oracle 
##       -0.008572223        0.007496459        0.005333605       -0.016992092 
##          com_other         com_amazon          com_apple     com_salesforce 
##       -0.117764513        0.024842040        0.058950577        0.017955597 
##       com_facebook         com_google          com_intel      com_microsoft 
##        0.154430481        0.103931083       -0.047379113       -0.017624466 
##            com_ibm     title_pmanager        title_sweng     title_swengman 
##       -0.053749870        0.028654364        0.003724878        0.097217183 
##      title_datasci        title_other        title_hweng continent_namerica 
##       -0.007392478       -0.070825327       -0.008278636        0.157381473 
## continent_samerica   continent_africa  continent_oceania     continent_asia 
##       -0.027962620       -0.004981926        0.012755926       -0.150943598 
##   continent_europe 
##       -0.056632805
svmPred_eps <- predict(svmModel_eps, testData)
RMSE_eps <- sqrt(mean((svmPred_eps - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_eps, "\n")
## 
##  Root Mean Squared Error: 94267.23
svmModel_nu_ra <- svm(Ytrain ~ ., data = trainData, kernel = "radial",type = "nu-regression")
summary(svmModel_nu_ra)
## 
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "radial", type = "nu-regression")
## 
## 
## Parameters:
##    SVM-Type:  nu-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##          nu:  0.5 
## 
## Number of Support Vectors:  8034
svmPred_nu_ra <- predict(svmModel_nu_ra, testData)
RMSE_nu_ra <- sqrt(mean((svmPred_nu_ra - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_nu_ra, "\n")
## 
##  Root Mean Squared Error: 90429.82
svmModel_eps_ra <- svm(Ytrain ~ ., data = trainData, kernel = "radial", type = "eps-regression")
summary(svmModel_eps_ra)
## 
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "radial", type = "eps-regression")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.02777778 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  11993
svmPred_eps_ra <- predict(svmModel_eps_ra, testData)
RMSE_eps_ra <- sqrt(mean((svmPred_eps_ra - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_eps_ra, "\n")
## 
##  Root Mean Squared Error: 90905.04
svmModel_nu_po <- svm(Ytrain ~ ., data = trainData, kernel = "polynomial", type = "nu-regression")
summary(svmModel_nu_po)
## 
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "polynomial", 
##     type = "nu-regression")
## 
## 
## Parameters:
##    SVM-Type:  nu-regression 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
##          nu:  0.5 
## 
## Number of Support Vectors:  8188
svmPred_nu_po <- predict(svmModel_nu_po, testData)
RMSE_nu_po <- sqrt(mean((svmPred_nu_po - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_nu_po, "\n")
## 
##  Root Mean Squared Error: 91844.69
svmModel_eps_po <- svm(Ytrain ~ ., data = trainData, kernel = "polynomial", type = "eps-regression")
summary(svmModel_eps_po)
## 
## Call:
## svm(formula = Ytrain ~ ., data = trainData, kernel = "polynomial", 
##     type = "eps-regression")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.02777778 
##      coef.0:  0 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  12146
svmPred_eps_po <- predict(svmModel_eps_po, testData)
RMSE_eps_po <- sqrt(mean((svmPred_eps_po - Ytest)^2))
cat("\n Root Mean Squared Error:", RMSE_eps_po, "\n")
## 
##  Root Mean Squared Error: 92446.72

This graph compares the real salary values and predictions for the four models:

# Create a data frame with the scaled salaries and predictions
dfgraph <- data.frame(scaled_salaries = Ytest, pred_nu =svmPred_nu, pred_eps =svmPred_eps,pred_nu_ra=svmPred_nu_ra,pred_eps_ra=svmPred_eps_ra)

# Sort the data frame by the scaled salaries for plotting
dfgraph <- dfgraph[order(dfgraph$scaled_salaries), ]

# Create a line plot of the scaled salaries and predictions

plot_ly(dfgraph, x = ~scaled_salaries) %>%
 add_lines(y = ~pred_eps, name = "Linear kernel + nu-regression", line = list(color = "red", width = 2)) %>%
  add_lines(y = ~pred_nu_ra, name = "Linear kernel + eps-regression", line = list(color = "green", width = 2)) %>%
  add_lines(y = ~pred_eps_ra, name = "Radial kernel + nu-regression", line = list(color = "blue", width = 2)) %>%
  add_lines(y = ~pred_eps_ra, name = "Radial kernel + eps-regression", line = list(color = "yellow", width = 2)) %>%
  layout(xaxis = list(title = "Real scaled Salaries"), yaxis = list(title = "Predicted scaled Salaries"), ylim = c(0, 6))

The model with the lowest RMSE is svmModel_nu_ra (radial kernel + nu-regression), followed closely by svmModel_eps_ra (radial kernel + eps-regression).

The residual plot of the best model is:

residuals <- Ytest - svmPred_nu_ra
plot(svmPred_nu_ra, residuals, main="Residual Plot", xlab="Predicted Values", ylab="Residuals")
abline(h = 0, col = "red")

And the Predicted vs Actual Values graph is:

lims <- range(c(Ytest, svmPred_nu_ra))
plot(Ytest, svmPred_nu_ra, main = "Predicted vs Actual Values (radial + nu regression)", xlab = "Actual Values", ylab = "Predicted Values", xlim = lims, ylim = lims)
abline(0, 1, col = "red")

To have an idea of how good/bad our results are, let’s have a look at the summary of the output variable:

summary(Ytest)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10000  117000  174000  197249  245000 1733000

We calculate the R-squared value of our best model:

R2_nu_ra <- 1 - sum((Ytest - svmPred_nu_ra)^2) / sum((Ytest - mean(Ytest))^2)


cat("Root mean Squared Error of SVR:", RMSE_nu_ra, "\n")
## Root mean Squared Error of SVR: 90429.82
cat("R squared coefficient of SVR:", R2_nu_ra, "\n")
## R squared coefficient of SVR: 0.4755912

Regression Trees

We continued analysis using the regression trees and random forest methods:

full_tree <- rpart(Ytrain~., data=trainData, method='anova')
rpart.plot(full_tree)

For all the trees we have tried to fine tune the parameters in order to obtain the lowest possible error and to avoid the overfitting. At the end we have obtained the smallest error (RMSE) for the trees that we kept the parameters set by default. Below we show as the example one of our trials:

## Attempts in  finding the cp parameter - the optimal one - for building the trees
## Calculated according to the lab 8


library(MLmetrics)

alt_full_tree <- rpart( Ytrain~., 
                     data = trainData, 
                     method='anova', 
                     control = rpart.control(
                       xval = 10, # number of cross-validations
                       minbucket = 2, # the minimum number of observations in any terminal (leaf) node
                       cp = 0.0)
)
optimalCp = alt_full_tree$cptable[which.min(alt_full_tree$cptable[,4]),1]
alt_full_tree_Optimal <- prune(alt_full_tree, cp=optimalCp)
#Plot the decision tree
rpart.plot(alt_full_tree_Optimal,type=3)

y_pred_tunedtree = predict(newdata=testData, object=alt_full_tree)
rmse_tunedtree = sqrt(MSE(y_pred_tunedtree, Ytest)) # ~10k
sprintf("RMSE for the tunedtree: %s", rmse_tunedtree)
## [1] "RMSE for the tunedtree: 103622.941400866"

Moreover, we build a tree only based on the social features: race, gender, education. Manipulating the default cp we obtained the best result:

social_tree <- rpart( Ytrain ~ ed_master+ed_bachelor+ed_doctor+ed_other+
                  race_asian+race_hispanic+race_white+race_black+race_two_or_more+
                  gender_female+gender_male+gender_other,
                  data=trainData, method='anova', control = rpart.control(cp = 0.005))
rpart.plot(social_tree)

As it was the most influential feature, we continue by removing education from the social features:

non_ed_tree <- rpart(Ytrain~race_asian+ race_hispanic+race_white+ race_black+ race_two_or_more+
                      gender_female+ gender_other + gender_male,
                     data=trainData, method='anova', control = rpart.control(cp = 0.0005))
rpart.plot(non_ed_tree)

As an extension, we also implement Random Forest analysis:

classifier_RF = randomForest(x = trainData, y= Ytrain, ntree=50, keep.forest=TRUE)
y_pred_RF = predict(newdata=testData, object=classifier_RF)

classifier_RT <- rpart(Ytrain~., data=trainData, method="anova" )
y_pred_RT = predict(newdata=testData, object=classifier_RT)

Comparing the chosen accuracy metric (RMSE) between regression tree and random forest:

rmse_RF = sqrt(MSE(y_pred_RF, Ytest)) # ~10k
rmse_RT = sqrt(MSE(y_pred_RT, Ytest)) # ~100k
R2_RF <- 1 - sum((Ytest - y_pred_RF)^2) / sum((Ytest - mean(Ytest))^2)
R2_RT <- 1 - sum((Ytest - y_pred_RT)^2) / sum((Ytest - mean(Ytest))^2)

Show all the metrics:

sprintf("RMSE for the regression tree method: %s", rmse_RT)
## [1] "RMSE for the regression tree method: 101918.807393375"
sprintf("R squared score for the  regression tree method: %s", R2_RT)
## [1] "R squared score for the  regression tree method: 0.333875655465425"
sprintf("RMSE for the random forest method: %s", rmse_RF)
## [1] "RMSE for the random forest method: 89600.6147020292"
sprintf("R squared score for the random forest method: %s", R2_RF)
## [1] "R squared score for the random forest method: 0.485164343137025"

Predicting our salaries using Random Forest

As aspiring data scientists, we wanted to know what our future salaries could look like. If properly implemented and tuned, this tool could be very useful for prospective STEM professionals to obtain a reference of what salary to ask for in job interviews, taking into account all variables discussed, including their education and professional careers. It can also be useful to choose a specialization, and a place to work to obtain the best possible salary.

Moreover, as part of any minority, it is useful to be aware of the effect of prejudice in prospective salaries, to be able to ask for a fair compensation.

We will begin by creating a dataframe with our characteristics:

# Ewa, Marina, Tymoteusz, LucĂ­a
ed_bachelor <- c(0,0,0,0)
ed_doctor <- c(0,0,0,0)
ed_other <- c(0,0,0,0)
ed_master <- c(1,1,1,1)

race_asian <- c(0,0,0,0)
race_hispanic <- c(0,0,0,0)
race_black <- c(0,0,0,0)
race_two_or_more <- c(0,0,0,0)
race_white <- c(1,1,1,1)

yearsofexperience <- c(1,0,2,3)
yearsatcompany <- c(0,0,0,0)

gender_female <- c(1,1,0,1)
gender_male <- c(0,0,1,0)
gender_other <- c(0,0,0,0)

com_oracle <- c(0,0,0,0)
com_other <- c(1,0,1,1)
com_amazon <- c(0,0,0,0)
com_apple <- c(0,0,0,0)
com_salesforce <- c(0,0,0,0)
com_facebook <- c(0,0,0,0)
com_google <- c(0,0,0,0)
com_intel <- c(0,0,0,0)
com_microsoft <- c(0,0,0,0)
com_ibm <- c(0,0,0,0)

title_pmanager <- c(0,0,0,0)
title_sweng <- c(0,0,1,0)
title_swengman <- c(0,0,0,0)
title_datasci <- c(1,1,0,1)
title_other <- c(0,0,0,0)
title_hweng <- c(0,0,0,0)

continent_namerica <- c(0,0,0,0)
continent_samerica <- c(0,0,0,0)
continent_africa <- c(0,0,0,0)
continent_oceania <- c(0,0,0,0)
continent_asia <- c(0,0,0,0)
continent_europe <- c(1,1,1,1)


#We include the characteristics in a dataframe
teamcharacteristics <- data.frame(ed_bachelor,ed_master,ed_other,ed_doctor,race_asian,race_white,race_hispanic,race_black,race_two_or_more,yearsofexperience,yearsatcompany,gender_female,gender_other,com_oracle,com_other,com_amazon,com_apple,com_salesforce,com_facebook,com_google,com_intel,com_microsoft,com_ibm,title_pmanager,title_sweng,title_swengman,title_datasci,title_other,title_hweng,continent_namerica,continent_samerica,continent_africa,continent_oceania,continent_asia,continent_europe,gender_male)
teamcharacteristics
names<- c('Marina','Ewa','Tymo','Lucia')
y_team_salaries = predict(newdata=teamcharacteristics, object=classifier_RF)


teampredictions <- data.frame(names,y_team_salaries)
teampredictions

Conclusions

As a summary we decided to gather all the metrics we used to evaluate the models and compare performance of the models:

Model RMSE R-squared
Linear Regression 91,757.71 0.4601
SVR (radial kernel + nu-regression) 90,429.82 0.4755
Regression Tree 101.918.80 0.3338
Random forest 89,600.61 0.4851

Average salary for all the people in our dataset is 197,425 USD. Linear Regression, SVR and Regression Trees models all have on average 90k RMSE, this means on average 46% error in the prediction of a salary. Due to such high inaccuracies, the abovementioned methods are not good predictors for this problem.
However the last model, Random Forest that we created has around 89k RMSE, meaning average error of 6%. This is a much better performance, and in our opinion is a good predictor. Due to the relatively high accuracy, we see its real value for the users and a potential business value.

References